Similarity Measures for Smooth Web Page Classification
نویسندگان
چکیده
This thesis examines the application of consistency learning techniques for the classification of hyperlinked web pages. Different data similarity measures between the web pages are defined, using local features like textual content and features of the linked pages as a graph. The pairwise object similarities are gathered in similarity matrices, each of which can be used together with methods from consistency learning to make classification smooth with respect to the data structure revealed by these similarity matrices and improve the accuracy of a simple text classifier. To achieve even better performance in the primary task of web page classification, a secondary machine learning problem is defined as finding the optimal similarity matrix combination. The results on several hyperlinked text collections, including the well known WebKB collection show significantly better accuracy of the smooth learning methods over the plain text classification. The main novel contribution of this thesis is the definition and testing of various similarity measures between web pages and the construction of a locally flexible similarity measure from heterogeneous data sources that improves classification accuracy on each of them. These ideas may also be used with little modification in other domains besides web page classification, like bioinformatics and citation graph classification.
منابع مشابه
A Comparative Study of Web-pages Classification Methods using Fuzzy Operators Applied to Arabic Web-pages
In this study, a fuzzy similarity approach for Arabic web pages classification is presented. The approach uses a fuzzy term-category relation by manipulating membership degree for the training data and the degree value for a test web page. Six measures are used and compared in this study. These measures include: Einstein, Algebraic, Hamacher, MinMax, Special case fuzzy and Bounded Difference ap...
متن کاملA Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification
In this paper, a novel filter-based approach is proposed using the PageRank algorithm to select the optimal subset of features as well as to compute their weights for web page classification. To evaluate the proposed approach multiple experiments are performed using accuracy score as the main criterion on four different datasets, namely WebKB, Reuters-R8, Reuters-R52, and 20NewsGroups. By analy...
متن کاملAn adaptive neural network approach to hypertext clustering
The WWW is an on-line hypertextual collection, and a more sophisticated algorithm for Web page clustering may have to be based on combined term-similarity and hyperlink-similarity measures. It has been observed that nearly all currently employed techniques for document classification on the Web make use of textual information only. In addition, most of these techniques are incapable of discover...
متن کاملSemantic similarity based web document classification using support vector machine
With the rapid growth of information on the World Wide Web (WWW), classification of web documents has become important for efficient information retrieval. Relevancy of information retrieved can also be improved by considering semantic relatedness between words which is a basic research area in fields of natural language processing, intelligent retrieval, document clustering and classification,...
متن کاملImpact of Similarity Measures on Web-page Clustering
Clustering of web documents enables (semi-)automated categorization, and facilitates certain types of search. Any clustering method has to embed the documents in a suitable similarity space. While several clustering methods and the associated similarity measures have been proposed in the past, there is no systematic comparative study of the impact of similarity metrics on cluster quality, possi...
متن کامل